Skip to content

Fix legacy endpoint backwards compatibility for _use_legacy_endpoint feature flag#45727

Open
slister1001 wants to merge 16 commits intoAzure:mainfrom
slister1001:fix/legacy-endpoint-backwards-compat
Open

Fix legacy endpoint backwards compatibility for _use_legacy_endpoint feature flag#45727
slister1001 wants to merge 16 commits intoAzure:mainfrom
slister1001:fix/legacy-endpoint-backwards-compat

Conversation

@slister1001
Copy link
Member

Fix 7 bugs that prevented the _use_legacy_endpoint=True flag from being fully backwards compatible with the pre-sync-migration behavior:

  1. Add bidirectional metric name mapping in evaluate_with_rai_service_sync() and evaluate_with_rai_service_sync_multimodal(): legacy endpoint gets hate_fairness, sync endpoint gets hate_unfairness, regardless of caller input.

  2. Skip _parse_eval_result() for legacy endpoint in _evaluate_query_response(): legacy returns a pre-parsed dict from parse_response(), return directly.

  3. Restore whole-conversation evaluation in _evaluate_conversation() when legacy endpoint: send all messages in a single call (pre-migration behavior) instead of per-turn evaluation.

  4. Remove dead effective_metric_name variable in _evaluation_processor.py: metric normalization is now handled at the routing layer.

  5. Pass evaluator_name in red team evaluation processor for telemetry.

  6. Add use_legacy_endpoint parameter to Foundry RAIServiceScorer and forward it to evaluate_with_rai_service_sync(). Remove redundant manual metric name mapping (now handled by routing layer).

  7. Update metric_mapping.py comment to document the routing layer approach.

Tests:

  • 9 new unit tests in test_legacy_endpoint_compat.py covering query/response, conversation, metric enum, and _parse_eval_result paths
  • 4 new unit tests in test_content_safety_rai_script.py covering routing, metric name mapping for both endpoints
  • 5 new e2e tests in test_builtin_evaluators.py covering all content safety evaluators with legacy endpoint, key format parity, and conversation mode

Description

Please add an informative description that covers that changes made by the pull request and link all relevant issues.

If an SDK is being regenerated based on a new API spec, a link to the pull request containing these API spec changes should be included above.

All SDK Contribution checklist:

  • The pull request does not introduce [breaking changes]
  • CHANGELOG is updated for new features, bug fixes or other significant changes.
  • I have read the contribution guidelines.

General Guidelines and Best Practices

  • Title of the pull request is clear and informative.
  • There are a small number of commits, each of which have an informative message. This means that previously merged commits do not appear in the history of the PR. For more information on cleaning up the commits in your PR, see this page.

Testing Guidelines

  • Pull request includes test coverage for the included changes.

@github-actions github-actions bot added the Evaluation Issues related to the client library for Azure AI Evaluation label Mar 16, 2026
…feature flag

Fix 7 bugs that prevented the _use_legacy_endpoint=True flag from being
fully backwards compatible with the pre-sync-migration behavior:

1. Add bidirectional metric name mapping in evaluate_with_rai_service_sync()
   and evaluate_with_rai_service_sync_multimodal(): legacy endpoint gets
   hate_fairness, sync endpoint gets hate_unfairness, regardless of caller input.

2. Skip _parse_eval_result() for legacy endpoint in _evaluate_query_response():
   legacy returns a pre-parsed dict from parse_response(), return directly.

3. Restore whole-conversation evaluation in _evaluate_conversation() when
   legacy endpoint: send all messages in a single call (pre-migration behavior)
   instead of per-turn evaluation.

4. Remove dead effective_metric_name variable in _evaluation_processor.py:
   metric normalization is now handled at the routing layer.

5. Pass evaluator_name in red team evaluation processor for telemetry.

6. Add use_legacy_endpoint parameter to Foundry RAIServiceScorer and forward
   it to evaluate_with_rai_service_sync(). Remove redundant manual metric
   name mapping (now handled by routing layer).

7. Update metric_mapping.py comment to document the routing layer approach.

Tests:
- 9 new unit tests in test_legacy_endpoint_compat.py covering query/response,
  conversation, metric enum, and _parse_eval_result paths
- 4 new unit tests in test_content_safety_rai_script.py covering routing,
  metric name mapping for both endpoints
- 5 new e2e tests in test_builtin_evaluators.py covering all content safety
  evaluators with legacy endpoint, key format parity, and conversation mode

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@slister1001 slister1001 force-pushed the fix/legacy-endpoint-backwards-compat branch from 2c18b75 to fa45fcc Compare March 16, 2026 22:20
slister1001 and others added 2 commits March 17, 2026 09:19
The 5 new legacy endpoint e2e tests require test proxy recordings
that don't exist yet. Mark them with pytest.mark.skip so CI passes
in playback mode. The tests work correctly in live mode (verified
locally).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@slister1001 slister1001 force-pushed the fix/legacy-endpoint-backwards-compat branch from 01cee1d to 876e9e5 Compare March 17, 2026 13:56
slister1001 and others added 2 commits March 17, 2026 10:28
- Record 5 new legacy endpoint e2e tests (pushed to azure-sdk-assets)
- Fix PROXY_URL callable check in conftest.py for local recording compat
- Fix missing request.getfixturevalue() in test_self_harm_evaluator

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
These files import azure.ai.evaluation.red_team which requires pyrit,
causing ImportError in CI environments without the redteam extra.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@slister1001 slister1001 force-pushed the fix/legacy-endpoint-backwards-compat branch from b02430f to f849593 Compare March 17, 2026 17:14
slister1001 and others added 2 commits March 17, 2026 13:50
…ests

- Map groundedness -> generic_groundedness for legacy annotation endpoint
- Set metric_display_name to preserve 'groundedness' output keys
- Add e2e tests for ALL evaluators with _use_legacy_endpoint=True:
  GroundednessPro, ProtectedMaterial, CodeVulnerability, IndirectAttack,
  UngroundedAttributes, ECI

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Replace if/elif chains with _SYNC_TO_LEGACY_METRIC_NAMES dict used
bidirectionally. Adding new metric mappings is now a one-line change.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@slister1001 slister1001 force-pushed the fix/legacy-endpoint-backwards-compat branch from fadf4a1 to 2f144c7 Compare March 17, 2026 18:51
slister1001 and others added 4 commits March 17, 2026 14:56
The legacy annotation API returns results under 'xpia' and 'eci' keys,
not 'indirect_attack' and 'election_critical_information'. Without this
mapping, parse_response cannot find the metric key in the response and
returns empty dict.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The legacy annotation API returns XPIA results under 'xpia' and ECI
under 'eci', but parse_response looked for 'indirect_attack' and
'election_critical_information'. Add _SYNC_TO_LEGACY_RESPONSE_KEYS
fallback lookup in both parse_response and _parse_content_harm_response.

Split mapping into two dicts:
- _SYNC_TO_LEGACY_METRIC_NAMES: metrics where the API request name differs
- _SYNC_TO_LEGACY_RESPONSE_KEYS: superset including response key differences

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
ECIEvaluator uses _InternalEvaluationMetrics.ECI = 'election_critical_information'
as metric_display_name, so output keys are election_critical_information_label,
not eci_label.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@slister1001 slister1001 marked this pull request as ready for review March 17, 2026 22:12
@slister1001 slister1001 requested a review from a team as a code owner March 17, 2026 22:12
Copilot AI review requested due to automatic review settings March 17, 2026 22:12
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR restores backward compatibility for the _use_legacy_endpoint=True feature flag across the Azure AI Evaluation RAI-service-backed evaluators and red team scoring, primarily by normalizing metric names between the legacy annotation endpoint and the newer sync_evals endpoint, and by aligning legacy conversation behavior with pre-sync-migration semantics.

Changes:

  • Add bidirectional metric/request/response-key normalization in rai_service.py and update evaluator routing to skip double-parsing for legacy responses.
  • Restore legacy whole-conversation evaluation behavior (single-call) and extend red team scorer/processor to pass endpoint choice and evaluator name for telemetry.
  • Add/extend unit + e2e tests for legacy endpoint parity (query/response, conversation, and metric mapping).

Reviewed changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
sdk/evaluation/azure-ai-evaluation/tests/unittests/test_legacy_endpoint_compat.py Adds focused unit tests for legacy endpoint behavior and parsing paths.
sdk/evaluation/azure-ai-evaluation/tests/unittests/test_content_safety_rai_script.py Adds routing/mapping unit tests for sync↔legacy metric normalization.
sdk/evaluation/azure-ai-evaluation/tests/e2etests/test_builtin_evaluators.py Adds e2e coverage for legacy endpoint across multiple built-in evaluators and output key parity.
sdk/evaluation/azure-ai-evaluation/tests/conftest.py Makes proxy URL handling robust when PROXY_URL is not callable.
sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/red_team/_utils/metric_mapping.py Updates documentation comment to reflect routing-layer normalization.
sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/red_team/_foundry/_rai_scorer.py Adds use_legacy_endpoint plumb-through to scoring calls and removes local metric remapping.
sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/red_team/_evaluation_processor.py Removes dead metric mapping logic and adds evaluator_name for telemetry.
sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/_evaluators/_common/_base_rai_svc_eval.py Adjusts conversation/query-response routing for legacy vs sync endpoint behavior.
sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/_common/rai_service.py Introduces centralized sync↔legacy metric and response-key mappings; adds endpoint-based normalization.
sdk/evaluation/azure-ai-evaluation/assets.json Updates assets tag reference.
Comments suppressed due to low confidence (1)

sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/_common/rai_service.py:1333

  • evaluate_with_rai_service_sync_multimodal maps some sync metric names to legacy names when use_legacy_endpoint=True, then calls evaluate_with_rai_service_multimodal, which ultimately calls parse_response(..., metric_name) with no metric_display_name. That means the returned dict keys will be based on the legacy metric name (e.g., hate_fairness_*), not the canonical sync name (hate_unfairness_*), breaking key-parity/back-compat. Mirror the metric_display_name approach used in evaluate_with_rai_service_sync (e.g., plumb a display name through the multimodal legacy path or post-process keys before returning).
    # Normalize metric name (same mapping as evaluate_with_rai_service_sync)
    metric_name_str = metric_name.value if hasattr(metric_name, "value") else metric_name
    if use_legacy_endpoint:
        legacy_name = _SYNC_TO_LEGACY_METRIC_NAMES.get(metric_name_str)
        if legacy_name:
            metric_name = legacy_name
    else:
        _legacy_to_sync = {v: k for k, v in _SYNC_TO_LEGACY_METRIC_NAMES.items()}
        sync_name = _legacy_to_sync.get(metric_name_str)
        if sync_name:
            metric_name = sync_name

    # Route to legacy endpoint if requested
    if use_legacy_endpoint:
        return await evaluate_with_rai_service_multimodal(
            messages=messages,
            metric_name=metric_name,
            project_scope=project_scope,
            credential=credential,
        )


You can also share your feedback on Copilot code review. Take the survey.

- Define _LEGACY_TO_SYNC_METRIC_NAMES at module level (avoid rebuilding on every call)
- Fix assertion in test to match string type (not enum)
- Remove unused @patch decorator and cred_mock parameter
- Delete test_legacy_endpoint_compat.py entirely
- Fix effective_metric_name NameError in _evaluation_processor.py lookup_names
- Route legacy conversation through sync wrapper for metric normalization
- Remove unused evaluate_with_rai_service_multimodal import

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copy link
Member

@nagkumar91 nagkumar91 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

🔴 Legacy conversation path returns wrong format (Critical)

_evaluate_conversation() legacy path (lines 140-147 of _base_rai_svc_eval.py) returns the raw result from evaluate_with_rai_service_sync_multimodal() directly — a flat dict like {"violence": 0, "violence_score": 0, ...}. But conversation evaluation is expected to return an evaluation_per_turn structure built by _aggregate_results(). The legacy path bypasses _aggregate_results() entirely.

The e2e test test_content_safety_evaluator_conversation_with_legacy_endpoint asserts:

assert "evaluation_per_turn" in score
assert len(score["evaluation_per_turn"]["violence"]) == 2

This will fail since the flat dict from the legacy endpoint has no evaluation_per_turn key. Either the legacy path needs to wrap its result through _aggregate_results(), or the test expectations need to match the actual legacy output format.


🟡 Missing metric_display_name in multimodal path (High)

evaluate_with_rai_service_sync() correctly sets metric_display_name = metric_display_name or metric_name_str before remapping to the legacy name, so output keys remain hate_unfairness_*. However, evaluate_with_rai_service_sync_multimodal() does NOT set metric_display_name:

# evaluate_with_rai_service_sync (correct):
if legacy_name:
    metric_display_name = metric_display_name or metric_name_str  # ✅ preserves original name
    metric_name = legacy_name

# evaluate_with_rai_service_sync_multimodal (missing):
if legacy_name:
    metric_name = legacy_name  # ❌ no metric_display_name preservation

This means legacy multimodal results will produce keys under the legacy name (hate_fairness_*) instead of the expected name (hate_unfairness_*).

Fix: Add metric_display_name = metric_display_name or metric_name_str in the multimodal function before metric_name = legacy_name.


🟡 Duplicated normalization logic (Medium)

The 10-line metric normalization block is copy-pasted in both evaluate_with_rai_service_sync() and evaluate_with_rai_service_sync_multimodal(). This duplication already caused bug #2 above (missing metric_display_name in one copy).

Fix: Extract to a helper:

def _normalize_metric_for_endpoint(metric_name, use_legacy_endpoint, metric_display_name=None):
    metric_name_str = metric_name.value if hasattr(metric_name, "value") else metric_name
    if use_legacy_endpoint:
        legacy_name = _SYNC_TO_LEGACY_METRIC_NAMES.get(metric_name_str)
        if legacy_name:
            return legacy_name, (metric_display_name or metric_name_str)
    else:
        sync_name = _LEGACY_TO_SYNC_METRIC_NAMES.get(metric_name_str)
        if sync_name:
            return sync_name, metric_display_name
    return metric_name, metric_display_name

🟡 Response key fallback applies unconditionally (Medium)

The legacy response key fallback in parse_response() and _parse_content_harm_response() applies even when NOT using the legacy endpoint:

if metric_name not in batch_response[0]:
    legacy_key = _SYNC_TO_LEGACY_RESPONSE_KEYS.get(...)
    if legacy_key and legacy_key in batch_response[0]:
        response_key = legacy_key  # runs even for sync endpoint

If the sync endpoint returns an unexpected format, this silently falls back to checking legacy keys instead of failing fast. This could mask API contract violations.

Fix: Thread use_legacy_endpoint through to these functions and only apply the fallback when use_legacy_endpoint=True.

- Extract _normalize_metric_for_endpoint() helper (fixes duplication + ensures
  metric_display_name is set in both sync and multimodal paths)
- Fix legacy conversation path to produce evaluation_per_turn structure by
  wrapping result through _aggregate_results()
- Add comments clarifying response key fallback is inherently legacy-only
  (parse_response is only called from legacy endpoint functions)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copy link
Member

@nagkumar91 nagkumar91 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for addressing the prior review. The helper extraction and _aggregate_results wrapping look good. Two remaining issues:

🔴 Conversation legacy test asserts wrong per-turn count (will fail)

test_content_safety_evaluator_conversation_with_legacy_endpoint asserts:

assert len(score["evaluation_per_turn"]["violence"]) == 2
assert len(score["evaluation_per_turn"]["violence_score"]) == 2

But the legacy path sends the entire conversation in a single call, then wraps the single result via _aggregate_results([result]). That produces evaluation_per_turn lists of length 1 (one aggregated result), not 2. The sync path evaluates per-turn (2 turns → length 2), but the legacy path intentionally evaluates as a single batch.

Fix: Either change assertions to == 1 (matching actual legacy behavior), or evaluate per-turn in legacy mode too (but that would contradict the stated goal of matching pre-migration behavior).


🟡 metric_display_name still not propagated in multimodal legacy path (High)

evaluate_with_rai_service_sync_multimodal discards the display name:

metric_name, _ = _normalize_metric_for_endpoint(metric_name, use_legacy_endpoint)

Downstream, evaluate_with_rai_service_multimodal() calls parse_response(annotation_response, metric_name) without metric_display_name, so output keys use the legacy name (hate_fairness_score, hate_fairness_reason) instead of the expected sync name (hate_unfairness_score, hate_unfairness_reason).

This affects HateUnfairnessEvaluator and GroundednessProEvaluator in conversation mode with legacy endpoint. It's untested because the only conversation+legacy test uses ViolenceEvaluator (no remapping needed).

Fix: Thread metric_display_name through to evaluate_with_rai_service_multimodal() and pass it to parse_response(). This requires adding the parameter to the multimodal function signature.

- Fix conversation legacy test: assert per-turn length == 1 (not 2), since
  legacy sends entire conversation as single call
- Thread metric_display_name through evaluate_with_rai_service_multimodal
  so legacy multimodal results use correct output key names (e.g.
  hate_unfairness_* not hate_fairness_*)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copy link
Member

@nagkumar91 nagkumar91 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All four issues from previous reviews are resolved:

  1. Duplicated normalization → Extracted to _normalize_metric_for_endpoint() helper ✅
  2. metric_display_name in multimodal → Now threaded through full chain: _normalize_metric_for_endpointevaluate_with_rai_service_sync_multimodalevaluate_with_rai_service_multimodal (new param) → parse_response
  3. Conversation legacy format → Wrapped via _aggregate_results([result]); test correctly asserts == 1 per-turn entries with explanatory comment ✅
  4. Response key fallback scope → Comments clarify parse_response is inherently legacy-only ✅

LGTM.

…_to_eval_input

The parent class _convert_kwargs_to_eval_input decomposes text conversations
into per-turn {query, response} pairs before _do_eval is called, routing to
_evaluate_query_response instead of _evaluate_conversation. This bypasses the
legacy single-call logic entirely.

Override _convert_kwargs_to_eval_input in RaiServiceEvaluatorBase to pass
conversations through intact when _use_legacy_endpoint=True, so
_evaluate_conversation is reached and sends all messages in one API call.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Move validate_conversation() call after the legacy endpoint check since it
requires multimodal (image) content. Text conversations routed through the
legacy path don't need this validation.

Re-recorded test_content_safety_evaluator_conversation_with_legacy_endpoint
in live mode and pushed new recordings.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Evaluation Issues related to the client library for Azure AI Evaluation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants